GridSearchCV reviews the performance of a set range of parameters on a cross-validation basis. This means only a portion of the training data is reviewed at any one time. When filling in the NA values with the mean value, however, we considered the whole set of training data.
Hence we took an inconsistent approach in reviewing only a portion of the data when running GridSearchCV, but the full set of data when filling in missing values. We can avoid this inconsistency by building pipelines and making imputations.
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('../data/train.csv')
We will leave the NA values in the column Age.
In [2]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
age_mean = df['Age'].mean()
from scipy.stats import mode
mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)
pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)
df = df.drop(['Sex', 'Embarked'], axis=1)
cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]
We replace the NA values in the column Age with a negative value marker -1, as the following bug disallows us from using a missing value marker:
In [3]:
df = df.fillna(-1)
We then review our dataset.
In [4]:
df.info()
In [5]:
train_data = df.values
We now build a pipeline to enable us to first impute the mean value of the column Age on the portion of the training data we are considering, and second, assess the performance of our tuning parameters.
In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
imputer = Imputer(strategy='mean', missing_values=-1)
classifier = RandomForestClassifier(n_estimators=100)
pipeline = Pipeline([
('imp', imputer),
('clf', classifier),
])
We note the slight change made to the syntax inside our parameter grid.
In [7]:
parameter_grid = {
'clf__max_features': [0.5, 1],
'clf__max_depth': [5, None],
}
We now run GridSearchCV as before but replacing the classifier with our pipeline.
In [8]:
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, verbose=3)
In [9]:
grid_search.fit(train_data[0::,1::], train_data[0::,0])
Out[9]:
In [10]:
sorted(grid_search.grid_scores_, key=lambda x: x.mean_validation_score)
grid_search.best_score_
grid_search.best_params_
Out[10]:
Now that we've determined the desired values for our tuning parameters, we can fill in the -1 values in the column Age with the mean and train our model.
In [11]:
df['Age'].describe()
Out[11]:
In [12]:
df['Age'] = df['Age'].map(lambda x: age_mean if x == -1 else x)
In [13]:
df['Age'].describe()
Out[13]:
In [14]:
train_data = df.values
In [15]:
model = RandomForestClassifier(n_estimators = 100, max_features=0.5, max_depth=5)
model = model.fit(train_data[0:,2:],train_data[0:,0])
In [16]:
df_test = pd.read_csv('../data/test.csv')
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
We can fill in the NA values in test data with the mean, since there is no analogous problem of snooping.
In [17]:
df_test['Age'] = df_test['Age'].fillna(age_mean)
In [18]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
fare_means[x['Pclass']] if pd.isnull(x['Fare'])
else x['Fare'], axis=1)
df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
axis=1)
df_test = df_test.drop(['Sex', 'Embarked'], axis=1)
test_data = df_test.values
output = model.predict(test_data[:,1:])
In [19]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
df_result.to_csv('../results/titanic_1-4.csv', index=False)